PROJECT 5
MAY 16-2020
Data Analyst Nanodegree
PISA may be a survey of students' skills and knowledge as they approach the top of compulsory education. it's not a conventional school test. instead of examining how well students have learned the varsity curriculum, it's at how well prepared they're for all times beyond school. Around 510,000 students in 65 economies took part within the PISA 2012 assessment of reading, mathematics and science representing about 28 million 15-year-olds globally. Of those economies, 44 took part in an assessment of creative problem solving and 18 in an assessment of monetary literacy.
1.Country
2.Gender
3.Overall_score: student's overall score of maths
4.Motivation: Giving motivation for studying and to get a future job
5.Anxiety: Helping to control their anxiety
6.Interest:To know their interest.
7.Work_ethic: Students work which include homeworks are viewed to know whether they complete it on time or done with or without ethics.
8.Parents: student's perceived view of parent's attitude towards mathematics is vital.
import numpy
import pandas
import matplotlib.pyplot as plt
import seaborn
%matplotlib inline
plt.style.use('fivethirtyeight')
data = pandas.read_csv('Desktop/pisa2012_clean.csv')
data.head()
print(data.shape,'\n')
print(data.info())
data[['country','gender']]=data[['country','gender']].astype('category')
print(data.describe())
We can see that there are 485,490 students within the dataset with 17 features (country, gender, overall math score, motivation, anxiety, interest, work ethic, behavior, self, and parents). Country and Gender are of category type and therefore the remainder of the features are numeric.So we can say that these are described in the given dataset of csv file.
From my analysis, I'm curious about seeing the consequences of gender has on students' mathematical skills. I wanted to require under consideration the students' math scores, but also the scholars and parent's feelings towards Mathematics.
I might also wish to know if the Mathematic Skills and Attitudes between Gender across different countries also if a student's perceived attitude of their parents differ between gender.
I expect that there'll be a better overall test and subsection score among males compared thereto of females. For test scores, we'll be watching the score_overall. I also expect there to be a better attitude towards mathematics among males compared thereto of the females. For attitudes, we'll be watching the subsequent features: motivation, anxiety, interest, work_ethic, behavior, and self. All attitude features, with the exception of 'anxiety', show that the upper the score is, the higher the student's attitude is towards Mathematics.
#Overall Score plotting
binsize=50
bins=numpy.arange(0,data['score_overall'].max()+binsize,binsize)
plt.figure(figsize=[12,6])
plt.hist(data= data,x='score_overall',bins=bins)
plt.title('OVERALL SCORES OF MATH', size=20)##title
plt.xlabel('Score of Math')## xaxis
plt.ylabel('No of Students');## yaxis
plt.savefig('hist_score_overall.png')
##Plotting pie graph
plt.pie(data['gender'].value_counts(), labels=data['gender'].value_counts().index,startangle=90,counterclock=False)
plt.axis('square')
plt.title('GENDER of STUDENTS');
print('Male Students: {:.4f}%'.format(data['gender'].value_counts()['Male']/data.shape[0]))##Getting the percentage of male students
print('Female Students: {:.4f}%'.format(data['gender'].value_counts()['Female']/data.shape[0]))##Getting the percentage of female students
order= data['country'].value_counts().index
##plotting the graph to get no of students on the basis of country they are from
plt.figure(figsize=[20,20])
seaborn.countplot(data=data,y='country',order=order)
plt.title(' STUDENTS by COUNTRY',size=20);
##plotting the box graph
plt.figure(figsize=(20,8))
seaborn.boxplot(data['country'].value_counts(), color='red')
seaborn.swarmplot(data['country'].value_counts(), color='green')
plt.title('STUDENTS by COUNTRY', size=20);
Most countries have between 5000ish to 7000ish students taking the survey. The boxplot also shows that there are a couple of outliers. for instance , Italy and Mexico have over 30,000 students and Liechtenstein has well below 1000 students.have over 30,000 students and Liechtenstein has well below 1000 students.
##For describtion
data['country'].value_counts().describe()
#Getting the attitude section
attitude=['motivation', 'work_ethic','interest', 'behavior','anxiety', 'self']
fig, ax= plt.subplots(nrows=2,ncols=3,figsize=[20,12])
##binsize can be found by 1/no of sections
binsizes=[1/4,1/5,1/4,1/9,1/8,1/5]
ax=ax.flatten()
i=0
for feature in attitude:
bins=numpy.arange(min(data[feature]), max(data[feature]) +binsizes[i], binsizes[i])
ax[i].hist(data=data, x=feature, bins=bins)
ax[i].set_xlabel('Scores')##xlabel
ax[i].set_ylabel('No of Students')##ylabel
ax[i].set_title(feature)##title
i+=1
under the motivation feature, it's just like the graph is slightly skewed to the left with spikes at points 3-3.25 and 3.75-4.
anxiety is nearly normally distributed
work_ethic is slightly skewed to the left with a spike at around 3 points.
behaviour is skewed to the proper showing that the majority students don't take extra actions towards Mathematics like taking about math with friends, play chess, or computer programming
##To get the parental attitudes
binsize=1/3
bins=numpy.arange(data['parents'].min(),data['parents'].max() +binsize, binsize)
plt.hist(data=data, x='parents', bins=bins)
plt.xlabel('score')##xlabel
plt.ylabel('number of students')##ylabel
plt.title('STUDENT VIEW FOR PARENTAL ATTITUDE TOWARDS MATH');##TITLE
##getting the percent of datas of scores with 90,95 and 99
data75=data.query('score_overall>=@data.score_overall.quantile(.75)')
data90=data.query('score_overall>=@data.score_overall.quantile(.90)')
data95=data.query('score_overall>=@data.score_overall.quantile(.95)')
data99=data.query('score_overall>=@data.score_overall.quantile(.99)')
##to draw piegraph based on percentages of 75,90,95 and 99
percent=[75,90,95,99]
fig, ax=plt.subplots(nrows=2, ncols=2, figsize=(10,10))
ax=ax.flatten()
for i in range(4):
plt.sca(ax[i])
datae=eval('data'+str(percent[i]))
plt.pie(datae['gender'].value_counts(), labels=datae['gender'].value_counts().index,startangle=90,counterclock=False,autopct='%.3f')
plt.axis('square')
plt.title('Top '+ str(percent[i])+'th percentage by gender')
order= data['country'].value_counts().index
##getting the plot on no of students on the basis of their gender and country from where they are.
plt.figure(figsize=[20,40])
seaborn.countplot(data=data,y='country',order=order,hue='gender')
plt.title('Number of Students by Country with Gender',size=35);
##sorting on the basis of country
country=data['country'].unique().tolist()
country.sort()
##grouping female and male students on the basis of their country
female=data.query('gender=="Female"').groupby('country').size()
male=data.query('gender=="Male"').groupby('country').size()
country==female.index.tolist()==male.index.tolist()
datagender=pandas.DataFrame({'country': country, 'female': female.values,'male':male.values})
##To get the female and male percent based on their cointries
datagender['total_pop']=datagender['female']+datagender['male']
datagender['female_percent']=100*(datagender['female']/datagender['total_pop'])
datagender['male_percent']=100*(datagender['male']/datagender['total_pop'])
datagender['diff_percent']=(numpy.absolute((datagender['female_percent']-datagender['male_percent'])))
datagender.head(10)##displayinh 1st 10 values
datagender['diff_percent'].describe()##to getthe diff_percent by gender wise
##plotting swarmplot
seaborn.swarmplot(data=datagender,x='diff_percent',color='red')
seaborn.boxplot(data=datagender, x='diff_percent',color='blue')
plt.title('PERCENT DIFFERENCE of COUNTRY by GENDER')##giving the title
plt.xlabel('percent');##xlabel
datagender.sort_values('diff_percent',ascending=False).head(10)##sorting based on the diff_percent
Here we can see that females are dominated
binsize=25
bins=numpy.arange(min(data['score_overall']),max(data['score_overall'])+binsize, binsize)
##plotting
plt.title('Overall Score of Students')
plt.hist(data=data.query('gender=="Female"'),x='score_overall', alpha=.4,bins=bins,label='Female')##querying female data
plt.hist(data=data.query('gender=="Male"'),x='score_overall', alpha=.4,bins=bins, label='Male')##querying male data
plt.legend();
seaborn.boxplot(data=data, y='parents', x='gender')##plot of parental view
;
#3getting the describtion of parental view
data.query('gender=="Female"').parents.describe(), data.query('gender=="Male"').parents.describe()
##plotting the histogram
plt.hist(data.query('gender=="Female"')['parents'],bins=numpy.arange(1,4+1/3,1/3), label='Female', alpha=.4)##getting female data
plt.hist(data.query('gender=="Male"')['parents'],bins=numpy.arange(1,4+1/3,1/3), label='Male', alpha=.4)##getting male data
plt.legend()
plt.title('Parental View Of Mathematics On The Basis Of Gender')##title
plt.xlabel('score')##xlabel
plt.ylabel('no of students');##ylabel
##plotting density curve diagram
seaborn.kdeplot(data=data.query('gender=="Female"')['parents'], shade=True, color='green', bw=1/4, label='Female') ##getting female data
seaborn.kdeplot(data=data.query('gender=="Male"')['parents'], shade=True, color='red', bw=1/4, label='Male') ##getting male data
plt.title('Parental View Of Mathematics On The Basis Of Gender')##title
plt.xlabel('score') ##xlabel
plt.ylabel('number of students'); ##ylabel
#Attitude of students
attitude=['motivation', 'anxiety','interest', 'work_ethic', 'behavior', 'self']
fig, ax= plt.subplots(nrows=2,ncols=3,figsize=[20,12])
#Determining binsizes
#binsizes = 1/(no of questions per sec)
binsizes=[1/4,1/5,1/4,1/9,1/8,1/5]
ax=ax.flatten()
i=0
for feature in attitude:
bins=numpy.arange(min(data[feature]), max(data[feature]) +binsizes[i], binsizes[i])
ax[i].hist(data=data.query('gender=="Female"'), x=feature, bins=bins,label='Female', alpha=.4)
ax[i].hist(data=data.query('gender=="Male"'),x=feature, bins=bins,label='Male', alpha=.4)
ax[i].set_xlabel('Score')##xlabel
ax[i].set_ylabel('No of Students')##ylabel
ax[i].set_title(feature)##title
ax[i].legend()
i+=1
plt.savefig('attitudes_gender.png')
##ploting the graph
fig, ax= plt.subplots(nrows=2,ncols=3,figsize=[20,12])
ax=ax.flatten()
i=0
for feature in attitude:
plt.sca(ax[i])
seaborn.scatterplot(x=data[feature],y=data['score_overall'],alpha=.05)
ax[i].set_ylabel('Score: {}'.format('Overall Score'))##ylabel
ax[i].set_xlabel('Attitude Score')##xlabel
ax[i].set_title('Overall Score by {}'.format(feature.title()))##title
i+=1
##analysing with the help of a heat map
seaborn.heatmap(data=data[['score_overall','motivation', 'anxiety','interest', 'work_ethic', 'behavior', 'self']].corr(),
center=0, cmap="RdBu_r",annot=True, vmin=-1, vmax=1);
##Analysing with the help of scatter plot
seaborn.scatterplot(x=data['parents'],y=data['score_overall'],alpha=.05);
##To get the overall score
data[['parents','score_overall']].corr()
##analysing using heatmap for parents view
seaborn.heatmap(data=data[attitude+['parents']].corr(), center=0, cmap="RdBu_r",annot=True, vmin=-1, vmax=1);
##getting the kdeplot for binsize of 1/4
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'motivation', bw=1/4)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').motivation.quantile(.5), color='green')
ax.axvline(x=data.query('country==@c').motivation.quantile(.75), color='pink', alpha=.75)
ax.legend()
##getting kdeplot of binsize 1/5
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'anxiety', bw=1/5)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').anxiety.quantile(.5), color='green')
ax.axvline(x=data.query('country==@c').anxiety.quantile(.75), color='violet', alpha=.75)
ax.legend()
## getting kdeplot
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'interest', bw=1/4)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').interest.quantile(.5), color='yellow')
ax.axvline(x=data.query('country==@c').interest.quantile(.75), color='green', alpha=.75)
ax.legend()
##plotting the kdeplot
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'work_ethic', bw=1/4)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').work_ethic.quantile(.5), color='red')
ax.axvline(x=data.query('country==@c').work_ethic.quantile(.75), color='black', alpha=.75)
ax.legend()
##plotting the kdeplot
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'behavior', bw=.125)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').behavior.quantile(.5), color='yellow')
ax.axvline(x=data.query('country==@c').behavior.quantile(.75), color='red', alpha=.75)
ax.legend()
##plotting kdeplot
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'self', bw=1/5)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').self.quantile(.5), color='green')
ax.axvline(x=data.query('country==@c').self.quantile(.75), color='black', alpha=.75)
ax.legend()
##getting kdeplot
n=seaborn.FacetGrid(data=data, col='country', hue='gender', col_wrap=4)
n.map(seaborn.kdeplot, 'parents', bw=1/3)
for ax, c in zip(n.axes.flat, country):
ax.axvline(x=data.query('country==@c').parents.quantile(.5), color='green')
ax.axvline(x=data.query('country==@c').parents.quantile(.75), color='violet', alpha=.75)
ax.legend()
LET'S GO AHEAD AND TAKE A LOOK AT THE CHANGES IN OUR DATA AS THE TOP PERCENTILE UNDER OVERALL SCORE INCREASES
##getting the data frame
datae=pandas.DataFrame(columns=['nth_percentile','m_prop','f_prop',
'avg_motivation','avg_anxiety','avg_interest','avg_work_ethic','avg_parents','avg_behavior','avg_self',
'm_motivation','m_anxiety','m_interest','m_work_ethic','m_parents','m_behavior','m_self',
'f_motivation','f_anxiety','f_interest','f_work_ethic','f_parents','f_behavior','f_self'])
##for loop is used to get the required datas
for i in range(100):
n=i*.01
s=data.score_overall.quantile(n)
datatop=data.query('score_overall>=@s')
mprop= datatop.gender.value_counts()['Male']/datatop.gender.value_counts().sum()
fprop= datatop.gender.value_counts()['Female']/datatop.gender.value_counts().sum()
motivation=datatop.motivation.mean()
anxiety=datatop.anxiety.mean()
interest=datatop.interest.mean()
work_ethic=datatop.work_ethic.mean()
parents=datatop.parents.mean()
behavior= datatop.behavior.mean()
self=datatop.self.mean()
dataf=datatop.query('gender=="Female"')
fmotivation=dataf.motivation.mean()
fanxiety=dataf.anxiety.mean()
finterest=dataf.interest.mean()
fwork_ethic=dataf.work_ethic.mean()
fparents=dataf.parents.mean()
fbehavior= dataf.behavior.mean()
fself=dataf.self.mean()
datam=datatop.query('gender=="Male"')
mmotivation=datam.motivation.mean()
manxiety=datam.anxiety.mean()
minterest=datam.interest.mean()
mworkethic=datam.work_ethic.mean()
mparents=datam.parents.mean()
mbehavior= datam.behavior.mean()
mself=datam.self.mean()
datae=datae.append({'nth_percentile':n,'mprop':mprop,'fprop':fprop,
'avg_motivation':motivation,'avg_anxiety':anxiety,'avg_interest':interest,
'avg_work_ethic':work_ethic,'avg_parents':parents,'avg_behavior':behavior,'avg_self':self,
'm_motivation':mmotivation,'m_anxiety':manxiety,'m_interest':minterest,
'm_work_ethic':mworkethic,'m_parents': mparents,'m_behavior':mbehavior,'m_self':mself,
'f_motivation':fmotivation,'f_anxiety':fanxiety,'f_interest':finterest,
'f_work_ethic':fwork_ethic,'f_parents':fparents,'f_behavior':fbehavior,'f_self':fbehavior},
ignore_index=True)
##getting the top part
datae.head()
##getting a 2 lines lineplot
plt.figure(figsize=(10,8))
seaborn.lineplot(data=datae,x='nth_percentile',y='fprop', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='mprop',label='Male')
plt.legend()
plt.title('GENDER PROPORTION OF STUDENTS WITH TOP PERCENTILE')
plt.ylabel('proportion')
plt.xlabel('nth percentile of overall math score');
#getting nth percintile of overall math score using lineplot
seaborn.lineplot(data=datae,x='nth_percentile',y='f_motivation', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_motivation', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_motivation', label='Average Total')
plt.legend()
plt.title('AVERAGE MOTIVATION SCORE OF STUDENTS WITH TOP PERCENTILE')##title
plt.ylim(bottom=1)#3limit
plt.ylabel('motivation score')##ylabel
plt.xlabel('nth percentile with overall math score');##xlabel
##To get average anxiety score using lineplot
seaborn.lineplot(data=datae,x='nth_percentile',y='f_anxiety', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_anxiety', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_anxiety', label='Average Total')
plt.legend()
plt.title('AVERAGE ANXIETY SCORE OF STUDENTS WITH TOP PERCENTILE')##TITLE
plt.ylim(bottom=1)##limit
plt.ylabel('anxiety score')##ylabel
plt.xlabel('nth percentile of overall math score');##xlabel
##To get average interest score using line plot
seaborn.lineplot(data=datae,x='nth_percentile',y='f_interest', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_interest', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_interest', label='Average Total')
plt.legend()
plt.title('AVERAGE INTEREST SCORE OF STUDENTS WITH TOP PERCENTILE')##title
plt.ylim(bottom=1)#limit
plt.ylabel('interest score')##ylabel
plt.xlabel('nth percentile of overall math score');##xlabel
##using two graphs to get a closer view to get average work ethic score
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
seaborn.lineplot(data=datae,x='nth_percentile',y='f_work_ethic', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_work_ethic', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_work_ethic', label='Average Total')
plt.legend()
plt.title('AVERAGE WORK ETHIC SCORE OF STUDENTS WITH TOP PERCENTILE')#title
plt.ylim(bottom=1)##limit
plt.ylabel('work ethic score')##ylabel
plt.xlabel('nth percentile of overall math score')##xlabel
plt.subplot(1,2,2)
seaborn.lineplot(data=datae,x='nth_percentile',y='f_work_ethic', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_work_ethic', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_work_ethic', label='Average Total')
plt.legend()
plt.title(' AVERAGE WORK ETHIC SCORE OF STUDENTS WITH TOP PERCENTILE :CLOSER VIEW')##title
plt.ylabel('work ethic score')##ylabel
plt.xlabel('nth percentile of overall math score');##xlabel
##using two graphs to get a closer view to get average parents score
plt.figure(figsize=(20,8))
plt.subplot(1,2,1)
seaborn.lineplot(data=datae,x='nth_percentile',y='f_parents', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_parents', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_parents', label='Average Total')
plt.legend()
plt.title('AVERAGE PARENTS SCORE FOR STUDENTS WITH TOP PERCENTILE')##title
plt.ylim(bottom=1)##limit
plt.ylabel('parents score')##ylabel
plt.xlabel('nth percentile of overall math score')##xlabel
plt.subplot(1,2,2)
seaborn.lineplot(data=datae,x='nth_percentile',y='f_parents', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_parents', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_parents', label='Average Total')
plt.legend()
plt.title(' AVERAGE PARENTS SCORE FOR STUDENTS WITH TOP PERCENTILE : CLOSER VIEW')##title
plt.ylabel('parent score')##YLABEL
plt.xlabel('nth percentile of overall math score');##XLABEL
##using lineplot to get average behaviour score
seaborn.lineplot(data=datae,x='nth_percentile',y='f_behavior', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_behavior', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_behavior', label='Average Total')
plt.legend()
plt.title('AVERAGE BEHAVIOUR SCORE OF STUDENTS WITH TOP PERCENTILE')##TITLE
plt.ylim(bottom=1)##limit
plt.ylabel('behavior score')##ylabel
plt.xlabel('nth percentile of overall math score');##xlabel
##using lineplot to get average self score
seaborn.lineplot(data=datae,x='nth_percentile',y='f_self', label='Female')
seaborn.lineplot(data=datae,x='nth_percentile',y='m_self', label='Male')
seaborn.lineplot(data=datae,x='nth_percentile',y='avg_self', label='Average Total')
plt.legend()
plt.title('AVERAGE SELF SCORE OF STUDENTS WITH TOP PERCENTILE')##title
plt.ylim(bottom=1)##limit
plt.ylabel('self score')##ylabel
plt.xlabel('nth percentile of overall math score');##xlabel
After these analysis we can see that the average self score for males is consistently higher that the typical self score for females. We can notice how there's an outsized gap between how females feel about themselves regarding their math skills as compared to how males feel about themselves in their math skills.